\chapter{Latent variable models for discrete data}


\section{Introduction}
In this chapter, we are concerned with latent variable models for discrete data, such as bit vectors, sequences of categorical variables, count vectors, graph structures, relational data, etc. These models can be used to analyse voting records, text and document collections, low-intensity images, movie ratings, etc. However, we will mostly focus on text analysis, and this will be reflected in our terminology.

Since we will be dealing with so many different kinds of data, we need some precise notation to keep things clear. When modeling variable-length sequences of categorical variables (i.e., symbols or \textbf{tokens}), such as words in a document, we will let $y_{il} \in \{1,\cdots,V\}$ represent the identity of the $l$'th word in document $i$,where $V$ is the number of possible words in the vocabulary. We assume $l=1:L_i$, where $L_i$ is the (known) length of document $i$, and $i=1:N$, where $N$ is the number of documents.

We will often ignore the word order, resulting in a \textbf{bag of words}. This can be reduced to a fixed length vector of counts (a histogram). We will use $n_{iv} \in \{0,1,\cdots,Li\}$ to denote the number of times word $v$ occurs in document $i$, for $v=1:V$. Note that the $N \times V$ count matrix $\vec{N}$ is often large but sparse, since we typically have many documents, but most words do not occur in any given document.

In some cases, we might have multiple different bags of words, e.g., bags of text words and bags of visual words. These correspond to different “channels” or types of features. We will denote these by $y_{irl}$, for $r=1:R$(the number of responses) and $l=1:L_{ir}$. If $L_{ir} =1$,it means we have a single token (a bag of length 1); in this case, we just write $y_{ir} \in \{1,\cdots,V_r\}$ for brevity. If every channel is just a single token, we write the fixed-size response vector as $y_{i,1:R}$; in this case, the $N \times R$ design matrix \vec{Y} will not be sparse. For example, in social science surveys, $y_{ir}$ could be the response of personito the $r$'th multi-choice question.

Out goal is to build joint probability models of $p(\vec{y}_i)$ or $p(\vec{n}_i)$ using latent variables to capture the correlations. We will then try to interpret the latent variables, which provide a compressed representation of the data. We provide an overview of some approaches in Section 27.2 TODO, before going into more detail in later sections.


\section{Distributed state LVMs for discrete data}

